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Abstract  - Efficient  algorithms  for  asynchronous  multiprocessor  systems  must  achieve 
a balance  between  low  process  communication  and  high  adaptability  to  variations  in 
process  speed.  Algorithms  which  employ  problem  decomposition  can  be  classified  as 
static  and  dynamic.  Static  and  dynamic  algorithms  are  particularly  suited  for  low 
process  communication  and  hi  adaptability,  respectively.  In  order  to  find  the  "best" 
method,  something  about  mean  execution  times  must  be  Known  Techniques  for  the 
analysis  of  tfie  mean  execution  time  are  developed  for  each  type  of  algorithm,  including 
applications  of  order  statistics  and  queueing  theory.  These  techniques  are  applied  in 
detail  to  (1)  static  generalizations  of  quicksort,  (2)  static  generalizations  of  merge  sort, 
and  (3)  a dynamic  generalization  of  quicksort. 
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1 - Introduction 


We  consider  the  design  and  analysis  of  k-process  algorithms  for  an 
asynchronous  multiprocessor  system,  which  consists  of  k or  more  processors  sharing  a 
common  memory  by  means  of  a switch  or  connecting  network.  In  addition  there  is  an 
operating  system  providing  such  functions  as  process  creation,  scheduling  of 
processes,  allocation  of  memory,  synchronization,  etc.  A real  example  of  such  a system 
IS  described  in  [7],  and  a general  discussion  of  asynchronous  parallel  algorithms  is 
presented  in  [5]  A k-process  algorithm  will  be  presented  by  giving  the  procedure 
each  process  executes  when  assigned  a processor.  We  will  assume  that  a processor  Is 
always  available  for  any  of  the  k processes  that  is  runnable. 

Given  a task  we  wish  to  execute  on  such  a system,  in  order  to  exploit  parallelism 
we  must  decompose  the  task  into  a set  of  subtasks.  Some  subtasks  cannot  begin  until 
others  which  they  depend  upon  finish;  this  establishes  a precedence  relation  between 
tasks.  Inefficiency  in  an  algorithm  arises  when  some  process  must  spend  too  much 
time  waiting  for  other  processes  to  complete  subtasks,  and  again  towards  the  end  of 
execution  when  there  are  fewer  than  k subtasks.  .Attempts  to  remedy  this  by  "evenly" 
dividing  the  original  task  are  hopeless,  since  task  execution  time  will  vary  due  to 
variations  in  the  input,  the  effects  of  other  users,  properties  of  the  operating  system, 
processor -memory  interference,  and  many  other  causes.  Any  efficient  air  must 

adapt  to  these  variations.  However,  this  adaptation  is  expensive,  in  tl  quires 

process  communication.  Thus  the  trade-off  between  adaptability  a.  process 
communication  must  be  considered  in  the  design  of  multiprocessor  algorithms.  In  the 
algorithms  considered  in  this  paper,  process  communication  takes  place  by  means  of 
global  data  accessible  by  all  processes.  Since  in  many  cases  access  to  this  global  data 
must  be  confined  to  a critical  section,  one  cause  of  process  communication  overhead  is 
the  interference  between  processes  seeking  access  to  this  global  data. 

Two  methods  of  decomposition  naturally  arise:  (1)  static  decomposition,  in  which 
the  set  of  subtasks  and  their  precedence  relations  are  known  before  execution,  end  (2) 
dynamic  decomposition,  in  which  the  set  of  subtasks  changes  during  execution  Static 
decomposition  algorithms  offer  the  possibility  of  very  low  process  communication,  / 

providing  there  are  not  too  many  tasks;  however,  their  adaptability  is  limited.  Dynamic 
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decomposition  algorithms  can  adapt  to  variations  in  task  execution  time  very  well,  but 
only  at  the  expense  of  high  process  communication. 


Given  a problem  which  can  be  decomposed  into  subproblems,  which  method  is 
bc'S^?  Is  the  extra  expense  necessary  for  fast  process  communication  tthus  supportir^g 
efficient  dynamic  algorithms)  justified’  If  a dynamic  algorithm  is  used,  how  far  should 
decomposition  proceed?  In  order  to  answer  these  questions  we  need  techniques  for 
finding  mean  execution  times  for  these  types  of  algorithms. 


In  section  2 algorithms  employing  static  decomposition  are  considered.  We 
develop  techniques  for  finding  the  probability  distribution  of  total  execution  time  in 
terms  of  the  distributions  of  individual  task  execution  times,  and  when  these  are  not 
Known,  techniques  for  finding  bounds  on  the  mean  execution  time.  In  section  3,  the 
mean  execution  time  for  a simple  model  of  a dynamic  algorithm  is  found,  assuming 
exponentially  distributed  task  execution  times.  In  sections  4 and  5 the  results  of 
section  2 are  applied  to  static  generalisations  of  quicksort  and  merge  sort.  Certain 
partitioning  strategies  are  shown  to  be  unsuitable  for  a static  decomposition  version  of 
quicksort.  In  addition,  a parallel  merging  algorithm  is  presented  and  analyzed.  In 
section  6 a dynamic  generalization  of  quicksort  is  presented.  Using  a result  of  section 
3,  the  mean  execution  time  is  found,  and  an  expression  for  the  optimal  degree  of 
decomposition  is  derived.  Section  7 contains  a summary  of  the  main  results. 


2 - Static  Decomposition  Algorithms 


Given  a set  of  tasks  ^n  ordered  by  a precedence  relation  <,  we 

call  T|  a predecessor  of  Jj  (Tj  a successor  of  T^)  if  T^sTj.  If  there  is  no  task  U such 
that  T|<U<Tj,  Tj  is  said  to  be  an  immediate  predecessor  of  Tj  (Tj  an  immediate 
successor  of  T^).  Tasks  with  no  predecessors  are  called  initial,  and  tasks  with  no 
successors  are  called  final.  In  the  execution  of  the  static  algorithm,  each  process  does 
the  following; 

(1)  Select  either  an  initial  task  or  a task  all  of  whose  predecessors  have 
been  completed,  which  has  not  already  been  selected.  Check  in  the  order 
Ti,T2....Tn. 


(?)  If  no  task  can  be  selpctecf,  go  to  sippp,  unless  all  tasks  have  alreacty 
been  selected,  in  which  case  terminate.  When  awakened  go  to  (1). 

(3)  Execute  the  selected  task. 

For  each  immediate  successor  of  the  task,  record  that  an  immediate 
predecessor  has  completed,  and  'waKe  up  a sleeping  process  if  possible. 

(5)  Repeat  from  ( 1 ). 


For  the  purposes  of  analysis  we  assume  that  steps  (1),(2),(4),  and  (S)  take  zero 
time,  and  that  the  execution  time  of  task  Tj  is  given  by  the  random  variable  tj,  with 
cumulative  distribution  function  (c.d.f.)  F^. 

Definition  - The  task-graph  G associated  with  T 1^2,. ..T^  and  < is  a directed  graph  with 
nodes  T j J2’  - ^n  arrows  from  T|  to  if  Tj  is  an  immediate  predecessor  of  Tj. 

Note  that  there  is  a one-to-one  correspondence  between  partially  ordered  sets 
of  tasks  and  task-graphs. 

Definition  - G is  a chain  if  the  tasks  are  totally  ordered 

The  length  of  a chain  is  the  number  of  tasks  in  the  chain.  If  in  a chain  the  initial 
task  IS  T|  and  the  final  task  is  T^  we  say  it  is  a chain  from  Tj  to  Tj.  A sub-graph  of  a 
task-graph  G which  is  a chain  is  said  to  be  a chain  in  G. 

Definition  - The  level  of  a task  T in  a task-graph  G is  the  maximum  length  of  any  chain 
in  G from  an  initial  task  to  T.  The  depth  of  G is  the  maximum  level  of  any  task. 

Definition  - A set  of  tasks  is  independent  if  for  any  tasks  Tj,  Tj  in  the  set,  neither  Tj<Tj 
nor  Tj<T|.  The  width  of  a task-graph  is  the  maximum  size  of  any  independent  subset 
of  tasks. 


Given  a task-graph  G,  let  t^  be  the  random  variable  representing  total  execution 
time  (the  time  from  when  all  processes  are  started  until  the  last  process  terminates). 
Assume  t^  has  c.d.f.  Fq.  In  the  following  def'nition  a class  of  task-graphs  is  defined 
for  which  Fq  can  be  expressed  simply  in  terms  of  the  F|. 

Definition  - Let  Cj,C2,...C^  be  all  chains  from  initial  to  final  tasks  in  G.  For  each  chain 
C,  containing  tasks  Tj^,Tj^,...,  let  E,  be  the  expression  (x,^  Xj^-...),  whore  X|,X2i— are 
polynomial  variables.  Then  G is  said  to  be  simple  if  the  polynomial  Ej+E2+-..*E^  can 
be  factored  so  that  each  variable  appears  exactly  once  (see  figure  2.1). 
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Corollary  - If  k>widfh(G),  the  t|  are  independent,  depth(G)-l,  and  the  nrij  tasks  on  level  j 
have  identically  distributed  execution  times  with  mean  Uj  and  standard  deviation  Sj, 
then 
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Let  W“Width(G).  When  w>k,  Fq  cannot  in  general  be  expressed  simply  in  terms  of 
the  Fj,  even  when  G is  simple  and  the  t^  are  independent.  For  example,  let  G consist  of 
^l’^2’^3  independent,  and  let  k-2.  Then 

tQ--max(min(t  j,t2)+t3,max(t  ptp)),  and  tg  cannot  be  simplified  further. 


When  w>k,  the  lower  bounds  for  E(tQ)  given  above  still  hold.  For  an  uppe, 
bound  we  take  the  following  approach.  It  is  assumed  that  w processes  are  created, 
and  each  process  has  a processor  available  at  least  k/w  of  the  time.  For  example,  the 
bound  given  in  the  corollary  becomes 


Is  jsl  ^ 


(2.4) 


Finally,  when  the  t,  are  dependent,  in  general  special  techniques  must  be  used, 
such  as  those  in  the  analysis  of  partitioning  strategies  (section  4)  or  parallel  merging 
(section  5)  . 


3 - A Dynamic  Decomposition  AI|orithm 


Given  a task  T and  a procedure  which  decomposes  a task  into  two  tasks  which 
may  be  executed  concurrently,  we  consider  the  following  dynamic  algorithm:.  First, 
there  is  a decomposition  phase,  in  which  each  process  repeatedly  removes  tasks  from 
the  task-queue  TQ  (which  initially  contains  only  T),  decomposes  the  task  and  inserts 
the  two  new  tasks  in  TQ,  until  there  is  a total  of  M tasks.  Next,  there  is  an  execution 
phase,  in  which  each  process  repeatedly  removes  tasks  from  TQ  and  executes  the  task. 
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We  analyze  this  algorithm  under  the  following  assumptions: 


( 1 ) In  this  section  the  time  to  access  TQ  is  assumed  to  be  0 

(?)  The  time  to  decompose  a task  is  assumed  to  be  exponentially 

distributed  with  mean  d^"^.  where  i is  the  current  total  number  of  tasks. 

(3)  The  time  to  execute  a task  is  assumed  to  be  exponentially  distributed 
with  mean  e^^"  ^ , 

We  use  standard  queueing  theory  techniques  in  the  analysis  (see  for  example 
[3]).  Adopting  as  a state  variable  the  total  number  of  tasks  in  TQ  or  currently  being 
executed  or  decomposed,  the  state-transition-rate  diagram  is  given  by  figure  3.1. 

Figure  3 1 
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The  mean  execution  time  is  found  to  be: 

T - '.  + y’ 1 (3.1) 

' k j m i n ( i , k)  d j 

lsisn-1 

where  - (1  + 1/2  + 1/3  + ...  + 1/k). 
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4 - Static  Quicksort 


We  consider  a static  generalization  of  quicksort  as  given  by  the  task-graph  of 
figure  4.1  (see  [6]  for  a complete  discussion  of  sequential  quicksort): 

Figure  4 1 
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The  tasks  may  be  described  as  follows: 

(1)  Pj  IS  a partition  of  the  file  to  be  sorted 

(2)  Pj  j (j  odd)  IS  a partition  of  the  left  subfile  produced  by  Pi-i^(j>l )/2' 

(3)  P|  J (j  even)  is  a partition  of  the  right  subfile  produced  by  P|_|  j^2- 

(4)  Sj  (j  odd)  IS  a quicksort  of  the  left  subfile  produced  by  Pl-j  (j  + l)/2- 

(5)  Sj  (j  even)  is  a quicksort  of  the  right  subfile  produced  by  Pl-ij/2' 

First  consider  the  simplest  case,  where  k is  a power  of  2 and  L“l+lg(k)  (where 
Ig  IS  log2)-  In  this  case  the  width  of  the  task  graph  is  k.  The  question  arises  as  to 
what  partitioning  strategy  to  use,  that  is,  how  should  the  partitioning  element  be 
selected  in  the  P tasks?  First  a definition  of  asysmptotic  mean  speedup; 

Definition  - Given  an  algorithm  for  k processes,  let  the  mean  total  execution  time  be 
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where  N is  the  size  ot  the  input  the''  the  asymptotic  mean  speedup  is 
defined  to  be 


Sk  - li 

We  would  prefer  a partitioning  strategy  which  gives  asymptotic  mean  speedup  of  K 
even  in  the  simplest  case;  strategies  which  depend  on  large  L for  speedup  are 
unsuitable  since  the  number  of  tasks  increases  exponentially  with  L,  and  one  of  the 
n-iam  advantages  of  static  algorithms  is  low  overhead 

It  IS  now  necessary  to  make  some  assumptions  about  the  execution  times  of 
tasks.  In  the  sequential  analysis  of  quicksort  it  is  found  that  partitioning  a file  of  size 
N takes  0(N)  time  with  standard  devalion  0(N),  and  that  sorting  a file  of  size  N takes 
0(N  lg(N))  time  with  standard  deviation  0(N}  (see  [6]).  Thus  in  analyzing  asymptotic 
mean  speedup  it  is  only  necessary  to  consider  the  sorting  task  times. 

(1)  When  the  partitioning  element  for  a partition  of  a file  of  size  M is  selected  at 
random,  it  is  natural  to  assume  that  either  subfile  size  is  uniformly  distributed  between 
0 and  M.  This,  together  with  the  fact  that  the  sum  of  the  subfile  sizes  is  M,  gives  an 
expected  maximum  subfile  size  of  3M/4.  Using  this,  it  is  easy  to  show  that  of  the  K 
subfiles  to  be  sorted  in  the  sorting  tasks,  the  expected  maximum  subfile  size  is  at  least 
(3/4)lg<K)rg,  ^hich  implies 

(2)  If  the  median  of  three  method  is  used  to  select  the  partitioning  element,  and 
if  it  IS  assumed  that  the  final  position  of  each  of  the  three  elements  in  the  subfile  is 
uniformly  distributed  between  0 and  M,  then  the  probability  density  function  for  the 
size  of  either  subfile  is: 


f (k)  . §jl  - 

n \ n;  n 

This  gives  an  expected  maximum  subfile  size  of  llM/16.  As  in  (1),  it  can  be  shown 
that  the  expected  maximum  size  of  the  subfiles  to  be  sorted  is  larger  than 
(1  It  follows  SgSk'K(^^/l 

(3)  If  the  partitioning  elements  for  all  partitioning  tasks  are  found  using  the 
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nieftiod  o(  sdmplesort  (firt!  pick  K 1 Oit.-fu  ' ■ • - !r -nly,  sort,  and  use  these  for  the  K- 

1 f’  tasks),  and  if  the  final  position  of  each  of  tfie  k 1 elements  is  assumed  to  be 
unifornly  distritnjted  between  0 and  N,  then  the  probability  density  function  for  the 
sl^e  of  the  largest  subfile  to  be  sorted  is: 

ffx)  - (-1 ) j"^k  (k-1 ) 

lsj<LN/xJ  Vj-1/^  N/  . 


(See  the  discussion  on  the  random  division  of  an  interval  in  [2]).  It  follows  the 
expected  maximum  size  of  the  subfiles  to  be  sorted  is: 

Kftxidx  - A Y Ml  - ik  N. 

J 0 Z_i  Vj-v  j ^ 

Isjsk 

Hence 

(4)  Finally  we  turn  to  the  partitioning  strategy  of  first  finding  the  median  (in 
0(M)  time,  where  M is  the  size  of  the  subfile)  in  each  P task,  and  using  the  median  as 
the  partitioning  element.  This  does  give  Sg“k,  but  it  should  be  noted  that  median 
finding  represents  a large  overhead.  Unless  process  communication  is  extremely 
expensive,  a dynamic  generalization  of  quicksort  (such  as  the  one  presented  in  section 
6)  IS  probably  better 


If  the  mean  and  standard  deviation  of  the  time  to  quicksort  a file  of  size  M are 
Ib(M)  and  b^^M,  and  the  mean  and  standard  deviation  of  the  time  to  find  the  median 
of  a file  of  size  M and  partition  the  file  using  the  median  as  partitioning  element  are 
apM  and  bpM,  then  from  equation  2.3  we  find  that  the  mean  total  execution  time  is  less 
than 


lcj/1  1 + 


k/  L 


2ap(l' 


k-1 

K 


V2kM 


\ y-fl—  - 

2—1  2 f ^2  i -1 

1 s j S I g (k ) -1 


When  L is  greater  than  l+lg(k)  a similar  result  may  be  found  using  equation  2.4. 
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5 - Static  Merge  Sort 


Consider  a static  generali/ation  of  merge  sort  as  given  by  figure  5.1  (see  [4]  for 
a discussion  of  sequential  merge  sort): 


Si  S2 
^2.1 


Figure  5.1 

SjL-l.i  SjL-l 

\ ^ 

M2  2^*2 


*^1-2,1  *^1-2,2  *^1-2,3  ^L-2,4 


\/ 

M, 


1-1,1  *^1-1,2 

^L,l 


X 


The  tasks  may  be  described  as  foiiows,  assuming  the  file  to  be  sorted  consists  of 
records  1 through  N: 


(1)  S|  IS  a merge  sort  of  all  the  records  between  (i-l)(N/2^“^) 
i(N/2^'h*l. 

(2)  Mp  , IS  a merge  of  the  two  sorted  files  produced  by  S2i-i  and  S2j- 

(3)  M|  j (i>2)  IS  a merge  of  the  two  sorted  files  produced  by  M|_i  2j_i 

^.-1.2, 


and 


and 


When  K IS  a power  of  2 and  L”l*lg(K),  the  width  of  the  task  graph  is  k and 
equation  2.3  may  be  applied.  Assuming  the  tine  to  merge  sort  a file  of  size  N has 
rnean  a^N  lg(N)  and  standard  deviation  b^,  and  that  the  time  to  merge  two  files  of 
sizes  M and  N has  mean  a^(M-rN)  and  standard  deviation  b^  (see  [4]),  we  find  that  the 
mean  total  execution  time  is  less  than 
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In  the  remainder  o<  this  section  we  consider  one  possible  improvement: 

replacing  the  merging  tasKs  with  parallel  merges  A two  task  merge  of  two  files  is 

possible  by  letting  each  task  be  an  instance  of  the  usual  sequential  two-way  merge 
(see  [A]),  except  that  in  one  task  merging  begins  with  the  two  smallest  items  of  the 
two  files  (a  merge  from  the  left),  and  in  the  other  task  merging  begins  with  the  two 

largest  items  (a  merge  from  the  right).  In  addition  the  two  tasks  are  interlinked  as 

follows:  m sequential  two-way  merge,  the  pointers  to  the  files  are  compared  to  the 
ends  of  the  files;  in  a two  task  merge,  the  pointers  of  one  task  are  compared  to  the 
pointers  of  the  other  task.  Because  of  this,  the  two  tasks  finish  together  almost 
exactly,  providing  one  has  not  already  finished  before  the  other  starts.  We  now 
assume  a sequential  two-way  merge  of  two  files  each  of  size  N takes  time  28^N. 
Hence  a two  process  merge  using  the  above  method  would  take  time  a^N. 

Next  consider  the  merging  algorithm  given  by  figure  5.2,  for  k-4: 


Figure  5 2 
■l  ‘2 

Lj  L2  ^2  ^3 

I 

L3 


Assume  the  elements  to  be  merged  are  x j ^ 1 ^y2'y3'"  yN'  tasks 

are; 
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Ij:  Insert 

l2'  Insert  y||S)/2J  ’'i'®- 

7:  The  results  of  the  insertions  determine  three  pairs  of  subfiles,  as 
shown  in  figure  5.3.  Z determines  the  subfile  pairs  and  initializes  the  L| 
and  Rj  tasks. 

Lj:  Merge  from  the  left  of  the  I’th  subfile  pair. 

R|:  Merge  from  the  right  of  the  i’th  subfile  pair. 


Figure  5.3 

' 

1 2/ 

/ f 

I ( 


If  process  1 executes  Lj  end  process  2 executes  L2  and  then  Rj,  process  1 
finishes  before  or  with  process  2.  Let  the  sizes  of  the  subfiles  in  the  second  subfile 
pair  be  X and  Y.  The  execution  time  for  process  2,  starting  at  the  completion  of  Z,  is: 


* 


2 


"I 


X+Y^/kj_v 

TV  ■ 


X + Y'l 


l/zJ 


a 


IK-YI  '1 


since  (X  + Y)/2sN/2.  The  same  result  holds  for  the  process  executing  R2  and  Lg.  In 
order  to  find  the  distribution  of  |X-Y|,  it  is  assumed  all  elements  Xj,  yj  are  distinct,  and 
that  all  permutations  are  equally  likely  Then  the  probability  of  inserting  in 

position  i is: 


^<yi<XaN‘yi+l> 


/i  4-aN-l\  /N(2~a)  - i'\ 

\ aN  - 1 j \ (l-a)N  j 
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22ti__  e-(i-aN)2/N 

( i ♦ aN  )VNit 

using  the  normal  approximation  to  the  binomial  distribution.  This  distribution  is  again 
approximately  normal,  with  mean  oN  and  standard  deviation VN/2.  Assuming  X and  Y 
are  actually  distributed  normally,  the  mean  of  |X-Y|  can  be  calculated  to  be V2N/n. 
Hence, 

E(tQ)  s (ii  + /CRU  Of  lg(N)  ) 

\ K 

where  the  OfIgfN))  term  is  from  the  insertion  tasks. 

Other  merging  algorithms  for  K-A  and  for  higher  k can  be  devised  by  using 
various  element  insertion  strategies.  Similar  techniques  may  bo  used  in  their  analysis. 


6 - Dynamic  Quicksort 


We  may  use  the  dynamic  algorithm  of  section  3 for  sorting,  where  tasKS  are 
considered  to  be  subfiles,  the  decomposition  of  a task  is  a partition  of  the  subfile  into 
two  subfiles,  and  the  execution  of  a task  is  a sort  of  the  subfile  In  analyzing  this 
algorithm  we  make  the  following  assumptions,  where  the  file  to  be  sorted  contains  N 
records; 

(1)  If  M IS  the  total  nur  ler  of  subfiles  to  be  produced  during  the 

decomposition  stage,  the  rl  number  of  task-queue  accesses  is  3M-2, 
and  each  process  makes  jproximate  average  of  3M/k  such  accesses 
We  therefore  assume  th  head  due  to  process  communication  is  linea’’ 

in  M,  and  is  given  by  w(l 

(2)  When  there  are  i suL  ies,  the  mean  subfile  size  is  N/i.  If  is  assumed 
the  time  needed  to  partition  a subfile  is  exponentially  distributed,  and  that 
when  there  is  a total  of  i subfiles  the  mean  time  is  aN/i 
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(3)  During  the  task  execution  phase,  the  average  subfile  size  is  N/M.  It  is 
assumed  the  time  to  sort  one  of  the  M subfiles  produced  by 
decompositioning  is  exponentially  distributed,  with  mean  b(N/M)ln(N/M). 

From  equation  3 1,  the  mean  execution  time  T(k^.N,k)  is: 

T(n.N.k)  - utkin  -t-  bpjin  j -»■ 


lSi<k-l  k<i<h-l 


-w(k)n-f-^i(blnN  - - b In  H) 

k 


-r  b/I:l'llnf!i'](H^  - 1)  + aNH. 

Vn/  In; 


Given  N and  k,  we  seek  to  find  M so  as  to  minimize  T(M,N,k).  If  we  approximate 
by  ln(M„  then  M must  satisfy 

il  - w(k)  + bN(H^-i)  I N J-L  'I 

an  kn  \ n^  / 

- 0 . 


let  A - and  D - 

bN(H^  - 1)  bk(H^-l) 

then  the  optimal  value  of  M is  the  solution  of 

f^.g(An^>Bn-l)  . 


A short  table  of  the  optimal  integer  value  of  M for  various  values  of  w(k)/b  follows, 
for  the  case  k»4,  a-b,  N-10®: 
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w (4 1 /b 

n 

10 

330 

10^ 

313 

103 

103 

10^ 

35 

10^ 

11 

Thus,  given  a,b,(M,  and  k,  the  optimal  degree  Ot  decomposition  is  determined  by  w(k), 
the  process  communication  overhead 


7 - Summary 


have  classified  asynchronous  multiprocessor  algorithms  which  employ 
problem  decomposition  as  static  and  dynamic.  Static  decomposition  algorithms  require 
little  process  communication  and  would  be  well-suited  for  systems  where  process 
communication  is  expensive,  e g.,  "loosely-coupled"  computer  networks. 

A static  decomposition  algorithm  is  described  by  a task-graph.  Simple  task- 
graphs  have  the  property  that  there  is  a simple  expression  for  the  probability 
distribution  of  total  execution  time  in  terms  of  the  probability  distributions  of  each 
task,  providing  the  result  of  one  task  does  not  affect  the  execution  time  of  another  If 
the  probability  distributions  of  each  task's  execution  time  are  unknown,  it  is  still 
possible  to  bound  mean  total  execution  times  providing  the  means  and  variances  of 
task  execution  times  are  known. 

Regarding  the  upper  bound  given  by  equation  2.3,  the  bound  is  tight  in  that 
task^graphs  and  task  execution  time  probability  distributions  may  be  constructed  so 
that  equality  holds,  using  distributions  derived  in  [2].  Any  improved  bound  would 
require  either  more  detailed  information  about  the  partial  ordering  of  the  tasks  in  the 
expression  of  the  bound,  or  additional  assumptions  about  the  probability  distributions 
of  task  execution  times. 
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When  process  communication  is  menpersive,  dynamic  decomposition  algorithms 
are  suitable.  Ore  technique  for  analyzing  these  alp.orithms  is  by  means  of  a queueing 
model.  Queueing  models  may  be  used  in  analyzing  other  types  of  asynchronous 
parallel  algorithms  as  well  (eg,  in  [1]  a queuemg  model  is  used  to  analyze 
asynchronous  iterative  methods). 

fur  ■■  de  omposition  algorithms  the  bounds  derived  in  section  2 may 

be  direch-  js  static  quicksort  with  median  finding  and  static  merge  sort. 

In  other  ca  ask  execution  times  are  dependent  other  techniques  must  be 

used.  This  is  the  :asv  ‘or  static  quicksort  when  median  finding  is  not  used  and  in  the 
parallel  merging  algorithm  presented.  These  algorithms  have  dependent  task  execution 
times  Since  there  are  tasks  where  the  input  size  depends  On  the  result  of  a previous 
task. 


The  assumption  that  process  communication  overhead  is  negligible  in  static 
ciecomposition  algorithms  is  valid  only  if  the  total  number  of  tasks  is  not  very  large. 
For  this  reason  we  have  given  bounds  on  mean  execution  time  only  for  those 
algorithms  in  which  the  width  of  ttie  task-graph  is  k (although  a technique  for  greater 
width  task-graphs  has  also  been  presented).  These  bounds  give  an  indication  of  the 
performance  that  can  be  expected  when  process  communication  overhcao  is  high 
enough  to  warrant  ttie  use  of  static  decomposition.  However,  in  dynamic  decomposition 
algorithms  we  may  choose  the  degree  of  decomposition,  which  should  ideally  be  chosen 
so  as  to  balance  process  corrirnuncication  overhead  and  adaptability  to  variations  in  the 
execution  times  of  tasks.  For  example,  by  applying  a queueing  model  to  a dynamic 
generalization  of  quicksort,  we  have  derived  an  expression  relating  process 
communication  overhead  and  the  optimal  degree  of  decomposition 


-17- 


Rafsrances 


[1]  Baudet,  Gerard  "Numental  Computation  on  Asynchronous  Multiprocessors", 
Thesis  Proposal,  Department  of  Computer  Science,  Carnegie-Mellon 
University,  1976 

[?]  David,  Herbert  A.  Order  Wiley,  1970 

[3]  Kleinrock,  Leonard  Queueirtg  Syiferru,  veil.  1 , Wiley  - Interscience,  1975 

[A]  Knuth,  Donald  The  Art  of  Coniputer  Prcgr<im'’iing,  vol.  3,  Addison-Wesley, 
1972 

[5]  Kung,  H.  T.  "Synchroni?ed  and  Asynchronous  Parallel  Algorithms  for 
Multiprocessors",  Algorithms  and  Complexity  - New  Directions  and  Recent 
Results,  ed.  J.  F.  Traub,  pp  153-200,  Academic  Press,  1976 

[6]  Sedgewick,  Robert  Qutckiorf,  Ph  D.  Thesis,  Computer  Science  Department, 
Stanford  University,  1975 

[7]  Wulf,  W A.,  and  C G.  Bell  "C.mmp  - A Multi-Mini-Processor",  Proceedings  of 
the  APIPS  1972  Pall  Joint  Computer  Conference,  vol.  41,  pp.  765-777,  1972 


VNCUXSSIFIKI) 


SECuRI'^V  CLASSIFICATION  OF  TmIS  page  Omim  F.ntt^rrxl) 


REPORT  DOCUMENTATION  PAGE 


* PEPQPT  number  12  GOVT  ACCESSION  NO 


4 T|TuE  Bnil  Subtitle] 

ANALYSIS  OK  ASYNCHRONOUS  MULTI  I’ROCLSSOR 
ALGORITHMS  WITH  APPLICATIONS  TO  SORTING 


AU  T MOP'  $) 

• lohn  T.  Robinson 


performing  OBGAMIZATIOS  name  and  address 

Carni'Bi  lion  University 
Computer  Science  Dept. 

I’i  1 1 slnirtih  ■ PA  1521) 

tl  CONTROLLING  OF  FICE  NAME  and  ADDRESS 

Office  of  Naval  Research 
Arlinst-on,  VA  22217 


MONITORING  AGENCY  NAME  » A D D R F SSf K d/ (/a 


'6  OlSTRiavTiON  ST  AYtMENT  (o(  lhl«  HAporl) 


I 


RKAD  INSTRUCTIONS 
nt-lFORK  COMPLKTING  FORM 


RECIPIENT'S  catalog  NUMBER 


5 type  oe  report  a period  covered 
I n t r r 3 ni 


performing  ORG  REPOPT  number 


0 contract  or  GRAn*^  NUMBER/'#; 

MCS75-222-55 

N000U-76-C-0'370;'^ 

NR  044-422  


10  program  element  project  task 
AREA  a WORK  unit  NUMBERS 


12  report  date 

July  1977 


13 

NUMBER  OF  P AGES 

_21) 

15 

security  class.  (oI 

tbl»  rmport) 

UNCMSSl  KlKI) 

15* 

OECL  ASSIFIC  ATION 

DOWNGRADING 

schedule 

Approved  for  public  release;  distribution  unlimited. 


IT  distribution  statement  (al  fh*  mhttrmct  »ntored  /ft  Block  20,  U dltlermnl  Irom  Hmport) 


19  KEY  WORDS  FConr/nu«  on  r*v9r*»  tldm  It  nocBBtmry  and  Idmntlfy  by  block  numbmr) 


20  A BST  R AC  T fConrino#  on  f9v#f9#  // n«c9#9«ry  and  IdFnf/fy  by  block  number;  Q ^ ^ algorithms  foT  d” 

synchronous  multiprocessor  systems  must  achieve  a balance  between  low  process 
commnni cation  and  high  adaptability  to  variations  in  process  speed.  Algorithms 
which  E'mploy  problem  decomposition  can  be  classified  as  static  and  dynamic. 
Static  and  dynamic  algorithms  are  particularly  suited  for  low  process  communica- 
tion and  high  adaptability,  respectively.  In  order  to  find  the  •’best#  method, 
something  about  mean  execution  times  must  be  known.  Techniques  for  the  analysis 
of  the  mean  execution  time  are  developed  for  each  type  of  algorithm.  Including 
applications  of  order  statistics  and  queueing  theory.  'lliest'  techniques  are  ap- 
pHed  in  detail  to  (1)  static  Generalizations  of,  qui  cksor  L , (2).  static  generall-| 

Nations  of  merge  soft,  and  (T)  a dynamic  generali/at ion  of  qiircksorf. J 


DD . 1473  COITION  OF  * NOV  00  IS  OBSOLCTC 
S/N  0103-014-  6001 


S C CUMiTY  classification  of  THIS  PAOe  {'irh*n'^«rA  gntbfd) 


